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ABSTEACT ^ <* „ " 

i Tfie purposes of this paper are five-fold to discuss: 
then item response theory (1ST)- equating methods should provide 
better results than^traditional methods; (?) which IB!f model, the 

.three-parameter logistic 4 or the one-parameter logistic (Basch), is t 
the most reasonable to use; (3) what unique contributions IBT methods 
can o'ffer the equating process; (4) what work has fceeji done that 
relates to the confidence that can be placed in the UBT/e qua ting . 
results; and (5) what unresolved issdes exist in thev§jpplication of 
IBT to equating. Several issues are discuased to provide a 
background: formal definitions and requirements, of equating; the 1 
basic principle of IBT equating; procedures for linking parameter 

"estimates and deriving estimate^ true and observed score equatings 
using IET; the practical advantages to be gained from using IBT 
equating; and the important distinction between test development and 
test analysis activities* (Author/Bi) » « ' 
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Introduction 



Large scale testing programs are often involved in either of. two situations 

\ 

that necessitate a consideration of the process of equating. In the first *^ 
situationr, a test has been constructed to measure a particular attribute, 
aptifude, or ability at some defined level of proficiency, and for a variety of 
reasons, most of them related ,to test security, multiple forms pf the test are 
necessary. As well defined j($ a 4e*£p£ content- and statistical specifications 
for a jt^st may be, it is usually impossible to construct multiple forms of the 
testjat exactly the same difficultly level. Since students taking di f f erent . t*est 
fonrs are usually either competing with each other for ^certain desired outcomes 
or beirig judged* as masters or non-masters of the test content* vis a vis a 
cut-oftf point, it is critical that a method of equating or rendering comparable, 
the' score57~Djr the cut-off points, on multiple forms of a test be considered.'* 
When the forms to be equated test content at the same difficulty level, the 

process has been referred to in the literature and in practice as Horizontal 

• * ' 

equating. 

In the. secoad situation, the testing program' is interested in establishing 
a sing'le scale that allows measurements to be compared for various levels of a 
defined atttibute', aptitude, ojt ability; there may or inay not be multiple , 
forms of the t-es v t at »the samG level- ■ For instance, many of the commercially 



marketed- test batterie^h^ve tests developed for various grade leve is ( tor 

examp^e*^ third, fifth, and sevintfc grade). Because aggregate scores are often 

* ■ 

compared acfoss levels (e.g., for •program evaluation purposes), it is critical 
that scores obtained # on the various levels of the test be Equated, i^e.^ placed* 
on a common underlying scale. This sort "of equating, referred to as vertical 
equating, is designed to convert to one single scale the scores on multiple 
forms of a test each Resigned to measure a different level of the same attribute* 

It should be noted that the intended product of both horizontal and 
vertical equating is obtained scores on multiple test forms th^ are on the same 
scale. In the case of horizontal equating, the forms to be equated have been 
constructed to be, identical in difficulty level but Idiffer for unintended 
reasons, while in vertical equating situations, the forms to, be equated have 
been intentionally constructed to differ, often substantially, in both content 
and difficulty level. As Slinde and Linn (1977) point out, "It is no surprise 

that the problem of vertical equating is substantially more difficult and 

# . • ; 

conceptually hazardous than that of hqrizpn^al equating. 11 

A commonly accepted way of viewing equating is that scores on two different 
fjorms of a test may be considered equivalent if their frequency distribution^ 
for a particular group of examinees are identical. This type of^ equating, 
referred to as equipercentile equating (see Angoff, 1971), can be accomplished 

• f 

by setting equal raw scores on two forms of a test that have the same percentile 

* i 

rank for the group of examinees. Such a process. .leads to a consideration of 
>> 

the extent to which the test forms being equated differ in difficulty and the 
* effect this has on the shape^of the raw score distributions when the same group 
of examinees takes both test forms. If the test forms differ considerably in 
difficulty, the fre quency distributions of the raw scores on the two forms will 



.differ considerably in shape. If the distributions, of raw scores on the two 
forms are forced to # have the same shape (by equipercentile equating) , then the 
raw score scale on one of the forms must be stretched and condensed to the 
extent that all t .moments of the distribution are transformed and the resulting 
relationship between raw scores on the two forms will bjg curvilinear. If, * 
however, the tests are very similar in level of difficulty, the shapes of the. 
two raw score distributions should differ only in the first two moments when ; 

4 

administered! to the same group of examinees. To effect a change in only the 

first two moments , 'thereby bringing the raw score distributions into coinciHence 

♦ 

a linear t rans forma tiotr*"may be used. The equating is done by setting equal- the 

standard deviates for scAres on tfte* _two test forms, resulting in an equation 
* J 

which expresses, thejlamear relationship between the raw scores on tl^§/two 
forms. Of course/ eaurpercentile equating used in this situation will also 
result in a linear relationship between raw scares on the two test forms, i.e., 
equipercentile methods applied to two, raw scolre distributions that differ only 
in their fir^t and second moments will transform only these moments. Evident 
from this discussion is that equiperentile methods should be used for most 
vertical equating situations (i.e., the. raw score distributions on the two Strife 
differ in more than the first and second moments) , whereas linear or equiper- 
centile methods may be appropriate for horizontal applications. Jaeger ( 1981 i 
has offered some procedures for choosing 1 between linear and equipercentile 

methods in horizontal equating sj.tuatj.ons. 4 
1 



should be noted that, In the a^ve discussion, The .sarae grotjp of 



'examinees were considered to have taken both test forms, thereby controlling 



* for" tt&reT possible differences in ability of the groups involved in the equating 
process. In reality, it is usually not the case that the same group takes 



both forms. % Usually different groups or samples of examinees of potentially 
'differing abilities take test forms of varying degrees of difficulty. A t 
common item block or anchor test is administered as a, portion of, or along with, 

\ - ... 

each form as a measure of <the, difference in ability between the two groups. It 
is this situation differences in test difficulties "contaminated" by differ-, 
ences in examinee abilities that has profound implications for the use of 



c 



traditional equating methods, particularly, when forms of quite different diffi- 
culties are given to groups or samples that^are quite disparate in ability (the 
usual vertical equating situation). Slinde and Linn (1977) have (^jpuss'ed in 
som£ detail the use of traditional methods in vertical equating situations and 
the inherent problems. , 

The interest in item* response theory (IRT) during the past decade has 
focused researchers' attention on the advantages, both theoretical and practical, 
that IRTmi^it of£er*to the equating process. Recently^ number of. research 
studies investigating the feasibility of using IRT equating have been performed. 
Also, a number of large scale testing programs are either presently using IRT* 
equating method/ or contemplating their use in the near future. Therefore it , 
u&B deemed useful at this point in time- tt?-^ttmmari^e both what has been learne^ 
thus far and what we still need to ledrn about the use of IRT equatfri^ methods. ^ 

The purposes of this paper are fivefold; to discuss 1) When IRT equating 

/ 

methods should provide better results tfian- traditional methods, ^tid when traditional 
methods should ^l|**^ e / 2) In those instances when IRT methods should provide 
better r^ejyuLts, whi'ch IRT model, the three-parameter logistic or the one-parameter 
logistic (Rasch), is the most reasonable to use, 3) What unique contributions 
can IRT methods offer the equaling process, 4)' What worjc has been done, at ETS 
and elsewhere, that relates to the confidence that can be placed in the IRT 



equating results, and 5) What unresolved issues exist in the. application of item 

response theory to the problem of equating tests* 

In order to I accomplish these purposes, a number of background topics will 

first be discussdd; these include i) the formal definitions and requirements of 

'equating (Angoff; 1971 ; Lord, 1977 ^ 1980) and tfre implications of these definiti 

for the equating that is normally, done, 2) the basic principle* of IRT equating, 

and the theoretical advantages it offers over traditional methods, 3) basic 

procedures for linking parameter estimates and deriving estimated true and 

observed score equatings using IRT, 4) the practical advantages to be gained* 

from using IRT equating rather than traditional equating in an operational 

• ** 

testing program, and 5) ,the distinction made by JRentz ^nd Bashaw (1'977) betw.een 
test development and test analysis activities, and why the distinction's 
important in discussions of equating. 

Background Information 

* 

Formal Requirements' for Equating 

Angoff (1971) has delineated, in the context of conventional equating 
methods, the basic requirements of equating; Lord (1977, 1980) has restated*and 

- ... . | Y 

elaborated upon these requirements in a form that is both illuminating and* 
amenable. ttf a consideration of IRT methods. These requirements will^be dis- 
cussed because they have a good deal of influence on what we realistically 
should expect the equating process to be able to do. According to Angoff 
(1971), there are four- restrictions or requirements to b^ met by the equating 
process: 1) the instruments in question should measure the sane attribute, 2) 
the .resulting conversion should be independent of the data used in deriving it 
and be applicable in all similar situations, 3) scores on the two fohrms should, 



9 

pns 



7 ■ - 



after equating, be interchangeable^ in use, and 4) the equating should be symmetric, 

t ' ' ' 

or the same regardless of which* form is designated as the base,* Angoff (1971) 
goes on to discuss, that equating \ and the issue of unique conversions, can only 
l^a addressed when the test forms are parallel, and cites the definition of 
parallelism* givea by GullilCsen .Q950): 

» • # 

"Two tes'ts may be considered parallel forms if, 
after conversion to the same scale, their* means, 
standard deviations, and correlations with any 'and „ 
all outside criteria are. equal. 

A number of comments can be made that should prove useful for the discus- 

* • * 

sion that follows. ,Qne, while the first restriction requires that' the two 

forms m^a'sure. the same attribute, it is not stipulated that the attribute be 
unidimensional . While there are certain ps'ychometrici'ans ," most *notably 
Lumsden (1960, 1976), who question whether measurement is meaningful for 
non-unidimeasional content domains, unidimensi'onality is nowhere specified 
in Angoff's equating" requirements. Unidimensionali ty , or a close approximation 
to it, will be a .formal requirement of IRT equating methods, meariing that 
somewhat tighter restrictions oa the nature of the test data must be met ^ 
for IRT applications. Two, the independence of tfief conversions from the data 
used .for deriving thenrfalls short in practice anytime ^the groups taking .the 
forms are not randomly equivalent samples from the population for which the 
conversions are to be # relevant. This is, in fact, the'usual situation in- 
equating, where frequently non-random groups, of ten differing in' ability, takev 
* • ■* 

the forms to be- equated. Three, as pointed'otit by Angoff (1971), the criterion 

of interchangea*bili ty of scores only holds when th« forms are equally reliable. 
Angoff also discusses the process of score calibration, swhich can be used for 
test forms of differing reliability. The calibrated forms can stiJLl be refer- 
enced to the same scale, but (theoretically) not used interchangeably. 



Lord (1977, 1980) has further -clarified the above ^restrictions, and in ■ 
doing, so, has pointed out the theoretical advantages to be gained from using 
IRT instead of traditional equating methods. Lord's (1977) formal definition 

of equating reflects in greater detail Angoff's third requirement, callecf 

- * • " . • \ ' 

the equity requirement*. * 

* • * < 

"Transformed scores y* and raw scores x can be ^ 
». called 'equated' if and only if it is a matter df - 

indifference to each examinee whether he is to take 
test X or test Y." * f 

• L > 
Under this definition, 1) tests measuring different traits or abilities can't 

be equated (comparable -to Angoff's first restriction), 2) raw/br observed 

scores on unequally reliable tests can't be formally equated (Angctff 's^ third 

/ * 
restriction), but also 3) ohseryed sqores on tests of varying difficulty, 

cannot be equated. Lord (1977) states: 

. n X£ tests X and Y are o'f different difficulties, 
the relation between their" true scores is necessarily 
nonlinear, because of fLbor afid ceiling effects. If ^' 
two tests have a non-linear relation, it is implausible 
-that they should be equally reliable for all subgroups 
J of examinees. This leads to the awkward conclusion that, 
strictly speaking, observed scores on *tests of *dif f ereat " 
difficulty cannot be equated." 1 • s * 

Lord (1980) shows further that while the equity. requirement can be met 

■ ■ % 

for perfectly reliable or inf pliable test data (i. e. ,jtf:rue" scores) , for 
observed score data the equity requirement can be met o^ly if the two forms 
are truly parallel (i.e., equivalent item by item), .in whi<-ht, ,case, equating 
would* not be necessary in the first place. " ✓ 

While the above wodld seem to build the case th^f ^.n» theory observed , 
score equating is not possible under any circumstances, in practice, this. is 
not true. Lord (1930) has notdd £hat in many practical situations, different 
forms of the same test have been develop^ to be sufficiently parallel that 



traditional procedures yield good results. There wilt be problems in practice, 
however, anytime the test forms to be equated are hot of the* same difficulty 
(i.e,, vertical* equating situations) and observed scores are to b£ u'sed. It' 
is for this reason, and also to satisfy Angoff's t restriction two (the convert 
sidns shoulfl^be independent of the groups used to obtain them), that IRT methods 
(lave great appeal for th'e solution of equating problems, 

Basic Principle of HRT- Equating 

, The basic underlying property of . IRT that makes it useful for equating 
application^ is as follows. If the data being considered for the equating li't 
the .assumptipns of an IRT model, it is possible*to obtain an estimate of an 
examinee's ability that is independent of the subset of items (test form) Ghat 
the examinee responds" to. Hence, it does not matter if an examinee takes an 

• v» * 

" * m ' * * 

easy or hard- form of a test; his/her ability estimate obtained from both forms 
-will jbe identical, within sampling error, once the parameter estimates are 
placed on .the- same scale. Therefore Che differences in difficulty of the forms 

. . . . ' « ' ■ 

being taken is no longer a concern*: Further, if one is 'willing ^to use the 

ability (0) metric for score reporting purposes , IRT eliminates the ne$d 

♦ :. *' ' v„ ' ' f ' 

<for equating* test forms. A1.1 that* remain^ to be addressed is the placing of 

«* * « , * 

parameter estimate?, derived from independent calibration^-, on the same scale. 

' .•' • • 

This linking process will be described in ,the next secpiofL 

T?or a ' v/ajkety ,,6f reasons, large y^cale testing programs are* often unable 

■. ' • '. • , . .# / . 

t'o rep'Ort" stores using the ability metric, and'instead most continue to report 
scaled scores in a traditional manner even though IRT h'as been used for equating 
putpqj^^/cTests specifically developed using IRT prqpeftures don't usually 
.jjLxf f er^t:he # same problem an& of ten use a variety^^tfr direct transformations of the 



ability metric, see Wright, 1977.) At EfS, the reason for continuation of the 

use of traditional scaled scores is that the scales existed long before IRT 

' — ^* 

equating was considered, and the scales have properties that are accepted and « 

understood by examinees. Fortunately, because any value of 0 can be mathe- 

maT^cally related to estimated true scores tfn the two forms, a*situation exists 

whereby IRT equating of these estimated true scores can be utilized and # traditional 

scaled scores reported. Further, Lord (1980) points out that the three require- 

meats of the equating process, equity, invariance across groups, and symmetry, 

which are not met when observed scores are equated, are met when true (perfectly 

reliable) scores are equated. Hence, testr forms of decidedly diff%rent difficulties 

can be equated if true scopes are used, and further, the groups no longer have 

." ' s Lf : ' - , .- 

to be random in order to derive an equating relationship, that is invariant 

/ 

across groups (from the same population).- This has prompted Lord (1977) to say 
that "...conventional equating methods are not strictly appropriate when^non- 
parallel tests having a non-linear relationship are administered to non-equivalent" 
groups." m - t , 

While the equating of IRT-deVived true .scores would *seem to solve a 

• * ■ ■ 

number of equating problems that have been discussed, it should be noted that 

in practice we work wi-£h true score estimates, not the true scorep, which , 

remain unknown values. L^rd (1980) has pointed but:. 

* however, an estimated true score does not/ 

have the properties* of true scores; an estimated f t 

true score, after aid, is just ..another kind of*' 

falfible observed score." " * 

While the 4 above is true, whdt is important to note is that observed scores and 
estimated true scores are somewhat different fallible scores, incorporating 
different kinds of error. Further, by selecting items that fit the IRT model 
and calibrating on large enough samples, we can insure that our true score 



'estimate's* are sufficiently closd^o* the act.ua! true values so as to derive. the 

/ ^ . * 

important benefits of the equating; thi$' is hot so easily done with observed 

scores. In sum/ while tfre estimated true scone equating will mot be perfect, 

it wi^Ll offer much .more* ii) problem equating situations (i\.e. , test* forms 

varying greatly in difficulty) then can be d.erived from conventional observed 
» • * ' • ^ 

score equating. 

. « * 

The IRT Egtaatang Process % " k * 

* IRT equating' can bejviewed simplistically as a two step process. Assuming 

that an IRT model has beeiv chosen, the* first st6p, involves choosing an equ-ating 

* I - 

4§sign and then dealing with the problem of getting parameter estimates from 

* .0 

sep.arate calibration runs within this design fin the same scale. '(When using , T 
certain computer progr 4 ams # such as LOGIST, it is often the case that all parameter 
estimation can be accomplished in a single calibration run.)' The second step 
involves performing the.sactu&l equating; if a program can report scores' on the 
ability metric, the equatitrg has been accomplished. However, because many 
testing programs report scores 09 some other scale, which is a transformation of 

" - f ' ' 

the raw ^score scale, the second step becomes necessary. 

* There are essentially three equating designs used in IRT equating, and 

these designs are analogous to the most frequently used conventional designs- • 

These designs kre referred to ^s the 1) single group, 2) randon) groups, and 3) 

•anchor test design. In the single group design, tt)e same gfoup takes both 

" * * *" ' * 

test forms to be equated. Because the same group takes both forms, differences ' - 

in test difficulty are not confounded by differences in group* abilities , and 

because of this, conventional methods work quite well, provided the forms are 

not of grossly differing difficulties. In the random groups design, two ^ * 



randomly selected groups each take a different fd ( rm Qf the test. If ^the groups 
are" truly random groups (from the # $ame population! - , they should be at equivalent' 
: ability levels, and once jggain, dif£eTei^k3 in test* form difficulty will not be 

"> ■ , * * \ 

confounded by ability differences, ? and convert ional methods should wcjrk well 

unless the forms' are of grossly. differing difficulties. In the third design, 

two different' groups of examinees take two different forms of a test; each form 

V *m - # ; • 

. either contains £ common set of items or a common anch&r test is given* with the-. 

forms- This is perhaps the most ^frequently used design for both horizontal and 

' ' > 

vertical equating'sitifatibns. The groups do not h'ave tp be random, and more 

often- they are not; if conventional methods ^are us.ed, the. commofc items are used 

c o ^j^st for ability, differences in the twcf groups. Depending both on the 

differences in. difficulty of the^forms and on the nature of the samples, this m 

adjustment may or may not be effective, and hence, for this design, IRT equating 

can be seen as a- very attj|kctive alternative. 

' y 

, As a means of clarifying the^need for .a separate step to place parameter 
estimates on the same scale, consider the, following situation*whic*v, while .not 
characteristic of a situation encountered in equating applications, is quite 
instructive • .Suppose the same set of items jLs given to two^dif f ei^^t" groups of 
examinees, and the parameters for these. items are estimated twice,. once in one 

group and then separately in the other, ■ Because the item characteristic curves 

f . ' ' 

axe supposedly independent of s the groups used to derive them, the expectation 

would be that^the two sets of item parameter estimates* would be identical, * 

» * * 

except for sampling etror; this is not so. When item and ability parameters are 
estimated simultaneously in the three-parameter logistic jnodel', to ensure 

t convergence in the. estimation procedure, " ability parameter estimates" $re^placed 

• 

on a scale with an arbitrarily chosen mean and standard deviation. The mean 

• .. .• r 

• 13 



ability; isi usually sec to zero-' an4 tfye'^standard deviation one, and the item 

? , . ■ ' 'c : • . • 

parameter estimates, (onl^dif f iculty and discrimination)^ are adjusted accordingly. 

If, the two groups' differ in ability level, the item parameter/ estimates will- ♦ 

* •* - ' . 

also differ. Tffere will, -however, be a- linear relationship between item di'ffi- 

*** / . , 

culties (or the 8's, which are on the v same metric) estimated in the two groups, 
and this relationship can bemused to ^lace^all parametet estimates on the same 
scale. . . » 

i 



It # shoul'd'be clearly understood that when all items are administered to a 
tingle 



single group of 'examinees and the .parameters are estimated simultaneously, the 



item parameters are on a common scale, Jtfhen this is not th$ case, i.e., when 

different sets of items are administered to the same group of examinees and 

calibrated separately, when the same set^of itqms are given to different* 

* 

/'groups of examinees, *or when different sets of items are< administered to 

different groups of examinees, the item parame££x estimates for the three- 

parameter logistic model are not on 'a commori scale and must be adjusted. This? 

' ' / r - 

adjustment is possih^^mly for the following three situation^: di f f eretit 

sets of items are aoiHP^ter.ed to the same group of examinees (common people are 

available), 2/ the same set of items* are administered to different groups* of 

examinees (common items are available)^ or, 3) some items, that are the same 

? . ^ 1 

(arffehor t;estj* and some items that are difJerent are administered to different 

groups of examinees (again ^otpmon items are available, but only a subset of thes^^j 

total). Situations one and three are characteristic of those encountered in ' 
■ » 

practical IRT equating applications using the single group and anchor test 

aesign. Situation two might be encountered when comparing parameter estimates 

/ * m * 

from pi^fesj:, data with parameter estimates frorn operational* form data. Appendix 

A of this paper describes in greater^ detail the procedures. used for placing item 



J •■• •••/ ■ I - 

parameter estimates on the same scale for the abofaf three situations using the 
three-parameter and also the pne-parameter logistfyb model. .Also contained i^ 1 
this Appendixes ari outline Which delineates t"he : *placing of the parameter 

estimates on the same scale for the three equating designs di^cu%sed abc^ye. m 

it. . 

As mentioned earlier,* if a testing program is unable to report ability 

. * ■ •//••.■'"*• 
estimates, to examinees, it is ^possible to' translate £ny value .of 9 "to corres- 

ponding estimated true scores on the % two forjms and use these estimated true 

scares as equated scores. This procedure is described* in detail in Appendix B. . 

It is also possible to us6 the estimated tru& scores to generate a frequency 

i . « 

distributipn of estimated number right observed scores, on the two test £orms. 

These scores may then be, equated using traditional equiperceobile methods.* It 

should be nq£ed that. while* the 9's estimated separately for two test fofms 

styare a linear relationship even if- the forms ar^ quite different in difficulty, 
• * \ • « 

the relationship between the estimated true scores will certainly be oon-line£fr 

if the forms, differ in difficulty** The same will be true of the relationship" • 

evidenced in the equating of the estimated observed score frequency distributions. 

Because of the special nature .of * the Rasch model, it is possible to use * 
* 

the ability estimates obtained from a parameter estimation program to directly 

* 

equate the actual observed scores. Like the other mefthods, this method is 
also not without its problems. The procedure is described in-more detail in a 
section of Appendix B, ^as are the problems involved with the procedure.^L 

\ 

Practical Advantages 1 ftf Using IRT Equating 

Besides the theoretical advantage offered earlier for using IRT equating 
methods, i.e., it is the only reasonable method to use when tests or test forms 
oi differing difficulty are given to non-random groups of differing abilities, 



thtere ar,e also, a nunjber of practical advantages to tie gained through t"he use of 
IRT. These include:,, ^ 

' 1, Improve^ ^quac^g^Hpcluding better equating at the end of the scale 

where important decisions /are oft^en made. As meptipned before, it is 

*■ ' * * ' 

- possible to* equate estimated true scores for all*valuea of 0, not 
* • c ■ 

just^those actually obtained from the data. 

2 K - Greater testrsecurity ^through less dependence on items in common with 
* J ' ' %\ ,~ + ■ 

a -single old form. If old.fprras of tests haye calibrated items on 

. » . » » 

the same' scale, the" common item block can come from multiple old 

\ forms'; • . 

^; 3. Easier re-equating should items be revised or deleted. Presently,, 

> : L '.. • o 1 . . , 1 

when traditioftal equating methods are used, if there are revisions or 
deletions of a substantial nature, the revised s form must be readmiit- 
istered fo^ equating ^purposes . If IRT equating of estimated true 
score's is used, the estimated true score for t^he revis£d^test can be 
gotten by^ simply summing over the P , ( 9 ) for those items left in the 
• - • revised" form. 

0 

* „ ^T^ e possible reduction, of bias or scale- drift which may occur in 

0 * 

equating* situations when traditional methods are used over time, 
most notably when the equating samples from the old and new forms are 
not random -samples (from the same population). This will be discussed 
further.in a Later stfctiop of- this paper. 
5^. The possibility of pre-equatirig, or deriving the relationship between 

, ». . . * ( ■ ■ 

> the test forms beforef they are administered operationally- This is 
possifrle'only when* frre-te^t data is available. The use of IRT for 
pre-equatitig offers a unique contribution that can'*t be derived using 
traditional methods*. . 



■ Test Construction and Test Analysis 

la discussing the. problem of model-data fit for the Rasch piodel, Rentz 

*and Bashaw (1975;, 1977) delineated the differences between test* construction 

i 

and test analysis activities, a distinction that will prove most useful in 
, clarifying when IRT equating methods are more advantageous tt^tirtraditional 9 
methods. In. t£St construction activities, the IRT model, in conjunction with 

* * • 

content specifications, is used as a -guide for selecting items on the test* 
Poorly fitting items to the model can be discarded, " and items of moderately 
poor fit can be modified. Rentz % and Bashaw (1977) state: I'Thuj, for this 
application, indications of topdel-data -fit are necessary for items , the 

presumption being that the final collection of items will includeonly those 

i « • 

dhat meet whatever criteria for fit might be- established. 0 Fbr purposes of a 

discussion of equating, in this context^, test construction woulji mean that the 

test to be equated and the base test have IRT parameter Estimates for i££ms 

that, fit; the faQdel well or moderately well before equaling is even considered. 

In the test analysis situation, the final test form is* fixed and badly 

fitting interns can't be* discarded. "Rather, the objective ^in tjiis case is to 

derive whatever benefits the model -is robust enough to provide, under potentially 

less-Chan-ideal item fit conditions." (Retitz and B£sUiaw,^1977) , . For equating 

purposes, test analysis activities would refer to fitting an I$T model— t<r 

already existing new and jTa^e test data so that equating can be facilitated 

* • - - \ 4 

through the use of IRT methods. It would seem reasonable, however r to cotfsfder 

fitting an IRT model for equating purposes only if the IRT method offered 

^something over any of the non-IRT .equating procedures. If* conventional 

9 * <&* % . . 

procedures are deemed adequate, and nothing additional can be derived from IRT 

procedures, then going to the expense of an IRT equating and dealing with the 



) problems of non-fitting items can be justified only in the weakest sense by 

the fact that it can serve- as a check on the conventional equating, 

* « 

Discu ssion Section 
« 

When should IRT equating methods provide better results' than traditional methods, 

, and when should traditional methods suffice? 
" & 

In answering this question; three distinctions are useful.- These are 1) 
Whether the equating i$ being dpne in a t£st construction or test analysis 
♦mode, 2) Whether the test obtest forms to bq, equated differ greatly in 
difficulty (this is t^he usual iKprizontal-vertical equating distinction, 
although it is possible tJPfr*j^test forms at the same level which differ 
greatily i£ difficulty), and 3) What is the nature of the samples taking the' 
tests or test' forms, kvk they random groups from the same population; If they 
are non-random, do they* dif f er^greatly in the ability being measured?. 

If the test forms to be equated have been specifically designed or con- 
structed using IRT test development procedures, then IRT methbds should be used 
for ' equating. It would ,prpve impractical to throw away useful parameter infor- 
mation and equate using traditional "methods . While it is true that the traditional 

methods will work well if the tests do not differ greatly in difficulty and the 

' . \ * S . / 

groups in ability, IRT procedures "protect 41 from the problems encountered when 

this is not the case. The IRT equating methods ' should work tolerably well 

across all combinations of differences in test difficulty and group ability. 

Choice of specific IRT model for equating will be dictated by the choice of th£ 

model used in the actual test ccmstruction process. « 

«• 

'If the best forms have been assembled using standard test development 
procedures , th£ test analysis mode), then IRT equating *shOu Id be considered 
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,only in those instances where traditional* methods do not work well. These 
instances include 1) vertical equating situations, where tests differing iu 
difficulty are given to groups of differing abilities* or Z) horizontal equating 

o — -~- * 

situations where test forms of differing difficulty are given to non-random 
groups that may differ in ability (the usual anchor test^design) . Further, if 
the test fonns'do not differ greatly in difficulty .but; the groups are non-random 
groups from the same population, conventional methods, while working tolerably 
well, will not insure that the equating results are generalizable to other 
groups for whom the forms are appropriate:" IRT equating methods, used in this 
instance, will insure generalizability. J? 

In an attempt .to clarify those instances in which IRT equating should • 
provide better results than traditional methods ,• entries have been placed in the 
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following 


two-way table: 


« 


• 

V 




« 


v Equating 




* 


Horizontal 


Vertical 




Test 
Construction 


i 

IRT 


IRT 


Activity 








* 

*> 


Test 
Analysis 


IRT ^r 
Conventional* 


IRT 



Q • . 

As substantiation for the ^bove generalizations, a number of research 

t 
i 

studies can be cited. Lord (1975), in comparing . traditional and IRT equating 



for the three basic equating designs, found good correspondence between traditional 
and IRT equatings for .te^sts hot differing widely in difficulty 



when u 



sing* the 



1 



s 



The choice of method should be determined through a consideration of the 
differences in difficulty of the test forms, the differences in ability 
of the groups, and the necessity for generalizable equating results. 
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single group an^random groups designs, where differences in ability, level are 
not an issu^^Lprd (1975) did .find, however, substantial differences between 

conventional and IRT equaling % f or tests differing in difficulty giv*en to non- 

y. j • ' • 

equivalent groups wh^ft. aping an ancHon test approach. Marco, Petersen, and 

• 1 V * 

Stewart U979) also Sound that IRT methods were superior to traditional methods 

when tests of differing difficulty were equated using an anchor test approach, 

• *" * # 

A number ^f researchers (Beard and Pettie, 1979; Golub-Smith, 1980;.Rentz and 

• ♦ i 
Bashaw, 1975 ,\ 1977) have confirmed the fact that traditional and IRT equatings 
* < " *t 

correspond well when a horizontal equating of test forms is done, even wh^n the' 

: \ * • ' ' ' 

test forms we*e not specifically developed to fit a particular IIJT* model. These 
researchers have been working with ttje Rasch model, and while Vhe results are 
encouraging in terms of suggesting the Rasch model, is robust in equating situations, 
from a practical standpbint, the fact that the methods behave similarly suggests 
continued use of conventional methods unless some additional benefits Accrue 

/ - • 

from the IRT equating, mm v 

x ' 
Wheri IRf methods provi&e better results, which IRT model should be used? 

Substantial recent research sheds some light on which IRT model to use 

v t 

when performing vertical equating in test analysis situations, Slinde and" « 

Linn (l978j 1,979), Loyd and Hoover -( I960) , and Kol^en (1981) have demonstrated, 

using either direct equating or indirect techniques, that the Rasch model is 

i • . 

-* 

probably inappropriate for the vertical equating of tests not specifically 
designed to fit the model. Gust^f'sson (1979a, 1979b) has pointed out one 
reason iror the failure o^the Rasch -model in this situation, When^guessing 
behavior is present in the item responses for tests being vertically equated, 
anegative correlation results between traditional itfem difficulty and item 

2j0 " • • ' 



discrimination indices, Since^item difficulties are bound to differ for the 
forms, the negative correlation forces the discriminations to vary also, thereby 
bringing to test the equal item discrimination assumption of the Rasch model. 
*WhiJ.e the results of the stfudy by Loyd and Hoover (1980) alsq demonstrate a 
problem with the Ra^ch. model for vertical equating situations, thesef authors are 
concerned that because the nature of the content specifications for the test 
change's appreciably with level, there may be" a problem of unidimensionality 
across levels that is causing the failure of the Rasch model. The issues raised 
by Gustafsson and Loyd and Hoover have implications as to whether the three- 
paramete? logistic model should be better than the Rasch model for vertical 
equating. If, Ws pointed out by Gustafsson (1979b), the item* discriminations * 
vary across forms due'to the exist;enc« of guessing, the three-paramener logistic • 
model,, which can handle variation in item discriminations and also guessing, 
should prove useful. If however, the problem is one of dimensionality, as Loyd^ 
and Hoover (1980) point out, no uniditoensional.IRT model can solve the problem.. 
Further, while certain studies (Kolen, 1981; Marco, Petersen, and Stewart, 1979) 
point to a superiority of the three-parameter logistic model in vertical equating 
.situations, there is always the problem of deciding on a criterion upon which to 
,jtfage which method ii superior. The results at present do seem to suggest, 
however, th^t the three-parameter logistic model of f ers -a~more viable alternative 
*for the vertical equating of approximately unidimensibnal tests. 

In the horizontal equating of test forms in test analysis situations, IRT 

, * ** 

methods should be considered when the te&t forms v differ somewhat in difficulty 
and the groups are non-random «and non-equivalent in nature, which usually 
Occurs with anchor test designs. The results of the Marco, Petersen, and 
Stewart study (1979) suggest that, for test forms that differ in difficulty 



4eveloped firom the same set of content specifications, the three-parameter 
logistic model is superior for equating purposes. Kolen ( 19810 has pointed out, 
however, (as did Marpo et al ) that the criterion for judging t^e superiority of 
^equating methods in their study may have.bsen biased against certain. of the 
methods. 

T^For the horizontal and vertical equating of test forms that have been 

- . . ' J 

specifically constructed to fit an IRT model, the choice of model for equating 

• * * 

follows ' automatically from the choice of model in the test construction process. 

Little has been specifically written, however, about which IRT model should * 

prove superior in test construction activities for horizontal and vertical 

equating situations. The comments that- follow «are gleaned from the research 

djo'ne on the vertical and horizontal equating of tests in a test analysis mode, 

with the hope that these results generalize to test construction activities. It 

would appear that for test: forms developed from the same set of test specifi- 

cations, either' the Rasch or three-parameter logistic model can be used in the 

V . . 7 

test construction process. Of course, the added assumptions of the Rasch model, 

# * » 

equal -item discrimination^ and no guessing, must be 'dealt -with, but if the 

* , s 

dfeveloper has reasonable flexibility to choose fitting items find still meet the 

original tpr slightly revijsecf) content specifications, the Rasch model is 

viable. In fact, it would be to the developer's best interest »to use: thej&asch 

model whenever possible because of jthe measurement consequences that result. 

When tests or test forms are being developed tcr purposely test at different 

levels however, the riature of £he~ content specifications must also change ^ 

'A ' ' ' 

somewhat Across levels (see Slinde and £inn, 1977)/ and because of this fact, it 

will be a mucti more difficult task to prepare items that measure the content 

specif ideations, are at a difficulty level appropriate for the level being 

\ % . 
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tested, and at the same time, are equally discriminating across all levels. It 
should be noted that if this is not possible, certain researchers* CLumsden, « 
•i*978*. Wood, -197^8) would say that 'there is a dimensionality problem. According 
to Lumsden (1978), "Test scaling models are self-contradictory if they" assert 

both unidimensionality and different slopes 'in the if em characteristic curves/' 

* / 

A similar conclusion may result, however t from purely content considerations, 

* . * 

Is it reasonable to expect th^ assumption of unidimensionalit/y.to underlie a set 

of test forms designed to <ept individuals at grossly different levels of ' \ 

ability? In sum, theiiss.ue in vertical test construction situations qay not 

ultimately be whether the three-parameter logistic model is more viable than the » 

Rasch model, .but whether any IRT model is appropriate. This of course is an \ 

f * * * m * 

equally reasonable* question to pose for vertical equating in test analysis > 

. .'. • . ' • * 

situations. . 



What unique contributions can IRT methods offer the equating process? 

? \ There are at least three situations in which IRT methods can make a 
unique contribution to the process of* test equating; that is, an equating can 

be accomplished that would have been either impossible or of minimal utility. 

,J * 

when using conventional methods. 

»- * * 

The first of these situations involves the pre-equating of test forms. 

Pre-equ-ating refers to the process of establishing equating conversions between 

' s " 

a new form arid a base form or forms prior to the time the* mew form is admin- # 

istered. The process depends on thg adequate. ^pretesting of a pool of items 

from which the new test form wiil be built, the calibration^of these ittems 

■ i » 

using IRT methods, and the utilization of a linking scheme to place the IRT 
parameters from the pretested items all on the same scale and also on the same 
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scale as the old 'form(s) . The process of pre-e'quating is"presently under ' 

/ : .' . * ' - " ' % 

, investigation at ETS because at least three very important outcomes accrue 

from the process. One,> IRT-ba^ed pre-equating is unafl$cted by the possible' 

• " ; ' 

future problem of revealing common % i tern -equating sections under disclosure 

ilegislati^i because there would be no need for thesfe sections 'In 'the first 

place; Twq, since 'ecpiating^ using IRT pre-equatinfc^methods rs j^ssible prior 

to the actual administration of the test, new test form$ can be intrpduced at 

low volume administrations ; a particular p.roblem if conventional methods had ' 

to s bemused* Three, pre-equating removes the equating procesS from the score < 

reporting cycle (the period frdm the time the test is administered to the time ' 

scores are reported), thereby minimizing the chance of equating errors and at 

Che same \ime freeing up time for other psychometric activities.. 

A s^corfS unique contribution of IRT to the tes*t equating process involves 

4 • 

equating tests that do not contain common -items and,, at present, can't be pre- 
equated. As an example, consider the following. Each October, two forms 
of the Preliminary Scholastic Aptitude Test/National Merit Scholarship Qualifying 
Test (PSAT/NMSQT) are administered, and for security reasons, the two formsr 
contain no common items. As a result, the two forms are not equated*" to^ach 
other*, but are both equated to the same two old SAT test forms. Comparability 
of scores across the two f«rfms is thus established indirectly through a mutual 
relationship with the SAT, forms. It. would otfviously be more desirable to effect 
a direct form-to-form equating rather than depend on the indirect equating 
presently used^fc If the data collected at the two administrations can be arranged 
as in Figure 1 ,* if is ^possible, using LOGIST, to estimate all item and 'ability 
parameters in, a single computer run. 'Hence, item parameters for botfi PSAT/NMSQT 
forms will be on the same scale, thus providing a direct equating of the ability 
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Figure 1: Calibration Plan for Direct IRT Equating of PSAT/NMSQT Form 1 Verbal Section to PSAT/NMSQT Form 2 
Verbal Section. The e.neire matrix represents a*single calibration run. Crosses indicate items 
that examinee groups were actually exposed to, Each PSAT/NMSQT and SAT sary^e conrtains approximately 
.2/000 cases. % . . , 
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estimates. An equating of estimated true scores or estimated observed *score 



frequency distributions automatically follows-. The Results* of doing the above \ 

, • '—^ " V . V • . • • ' 

have been reported by* Cook, Dunbar, and Eignor (1081), 

The final unique contribution of IRT to the equating process involves the 
equating of a test comprised of items from a locally* developed item bank' to a 
standardized norm-referenced test that has national norms data. Any test made 
up of items from the item bank may then be used in conjunction with the norms 
data, provided the items jErbm the bank have been calibrated and placed orf the 
saute underlying scale. The items Comprising the test can then be matched to the 
measurement need (for instance, pretest or posttest) and the norms data can>be 
used for evaluation of pupil g^owth^ Holmes (1980) has investigated the above 
procedure for use in TitTe I evaluations, using the Tme-^>.arameter logistic or 
Rasch model. , TheGfJ'ocedur e, based upon what was documented in the Holmes 
report, is as follows: s ' 

1. A local item bank testing relevant content taught in a district or. 

V 

system is developed. An IRT model is fit; to the items (the content 

% dpmain must be reasonably unidimensional) , based on pre-test data, 

and all the parameter' estimates are placed on the- same scale, ' 

« * 

f 2. A n'orm-ref erericed test wfiich tests comparably content j£jad has tepre-^ 
' sentative national norms is/selected, 
3. .A test built from the local item bank' and che norm-referenced test 

. « ; . • • " ' , 

are' administered to the £gn& group of examinees. " . 4 

S 



4, All items fcom both tests are calibrated together- For a particular 

- item bank te^t score, the equivalent ability estimate can be determined, 

- . * . v • - . ■ 

In turn, this estimate and the item parameters for ti?e norra-re£erehced 
. test allow the determinaticg^f the equivalent "normed test score. 



t This is done for the range of item bank test scores, which in total 

' ~ . : - ■ • ■ \ , ■ 

comprises an IRT (estimated true, score) equating of scares on both 
tests. ' , . " 

5. Each equated normed t«t score has a percentile rank associated with 

* 4 * * 

it that can be converted into a Normal Curve Equivalent (NCE) score » 

required for Title devaluation purposes. These percentile ranks can 

be determined through interpolation of the raw score to percentile 

-* norms table provided with the normed test. ; 

6. 'The equated item bank test scores are translated into item bank 
ability estimates using the item parameter estimates^already in 
existence for all the items from the pre-test data. t 

'. . • '/ . * 

7. The end result is a one-to-one correspondence between total item 

bank ability estimates and NCE units, to be used ftfr evaluation 

f 

purposes. 

* " . j 

8. Any possible subset of items ,from the bank selected for a particular r 

■ purpose results in measurement on the common 'ability metric which can 

* r~ 

be related to the NCE units. Tests that measure relevant local ' 
content and are peaked to provide maximum informatidn for £he examinee 
group c^n* then* be developed and administered with the resulting m * 

measurement of growth on the mandated NCE scale. 
The unique aspect of this process is not the equating of a locally developed 
test to a nationally normed test (this could be done using conventional methods), 
but the equating of the local bank ability scale to the norm-referenced test. 
Without having done this, each lqcally developed test would -llav^to^be equated, j 
rather* than the equating being done only once. * rv ' * 

It should be noted that a major concern e^Msed by Htflnfes in the project 
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report was fit of the /lata to the Rasch model. While we also share a similar 
concern, expressed further in a later section of the paper, nothing precludes 
the use the three-parameter model in Holmes' study. The equating done was not 
the actual raw to raw equating through^-estTCfiated abilities that can be done only 
.with the Rasch model, 'but instead, estimated true score equating, which can be- 
done 'with any ,of the models. T 

'What has been done that relates to the confidence that can be placed in, IRT 
equating results? , * * 

The problem involved in evaluating the results of any IRT equating concerns^ 
the criterion measure. Since ijobody ever knows what the trfre equating may be, 
i.e., the best criterion against which' to judgs the results of the actual 
equating, other criterion measures have often b/een devised; these vary in degree 
of complexity and in assumptions made. In situations wheref conventional equating 
methods are known to function well or have been in existence *f or some time, the 
results of the conventional method(s) forms asfl^iterion against v which the IRT 




equating may be' evaluated (spe Lord, 1975; Beard and Pettie, 1$79; Rentz and 
Bashaw, 1977; Golub-Smith, 1980; Marco, 1977; Woods and Wiley, 1977, 1978). In 
other situation^, the test itself may form a criterion; that is., the test is ** 
equated to itself (see Lord, 1975, 1977; Marco, Petersen, and Stewart, 1979), *' ' 
To the extent tfyat the equating results* coincide with .expectation, one* htfs 
confidence ill the method. In other situations, pne can use stability of equating 
uather than accuracy o£ equating as a criterion measure for evaluative purposes. 
Kolen (1981) cross-validated his equating results with random samplW of indivi<|uals 
More specifically, he formed frequency distributions for his random , cross-validation 
samples and then compared his equated score frequency distributions with these; 
a mean squared difference between scores with identical percentile ranks was 

' '29 ■. '"• ' 



used for evaluative purposes.; Loyd *nd Hoover- (1980) formed a somewhat different 
criterion again&t' which to evalute the results of their stud|r, *which involved 
the use of the Rasch model in vertical equating of forms givfcn to examinee 

* 

groups of differing abilities* They equated the same forms using. groups of 

comparable abilities* A comparison of the two equatings then allows one to 

* . r * c 

ascertain whether the results, obtained were greater than those expected from 

simple sampling differences in parameter estitaates obtained for groups of 

- — - • m ' 

comparable abilities, % • 

Another way to gain confidence in .IRT or cdnventional equating results is 

through a consideration of .the scale drift that' occurs' when multiple forms of 

♦ 

a test are equated over time. Scale drift will have occured if the results 

of equating Form A to FSrm.D^is not the same as that obtained by equating Form 

A to Form D through intervening Forms B and C* One would have /confidence in 

the, equating method that resulted. in the least scale drift, A problem with 

Jthe above example is that ther^ is no good V way of knowing which equating 

method wa? best for directly equating Form A to Form D, An excellent way of 

dealing with this problem is 'through the use of a circular closed chain, as 

depicted in Figure Z, Form V4, which has previously been put on scale, can 

be equated to itself through the five intervening forms. Any ^discrepancy 

-a » 
between the transformation obtained; from the circular chain of equatings 

and the initial- V4 scale could be attributed to'scale drift. One would then 

'have confidence in the equating method tfrat resulted in the least discrepancy 

between the initial scale of*V4 and the| scale resulting from the chain of 

equatings, A study comparing scale drift for IRT and conventional equating 

methods applied to aptitude test data has been done by Petersen, Cook, and, 

-Stocking (1981), A similar, study using achievement test data is presently . — - 

* * * ' * 3 .* 
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Figure 2: Verbal Aptitude lest Equating Cfiain Taken from Petersen, Cook, 
Stocking Study (1981). • 
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^Denotes operational verbal test form* 
^Denotes common item equating section. 
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y being conducted at ETS.' ■ 

In sum, a number of ways have been devised for evaluating IRT and 
conventional equaling results. "These t methods can be viewed as practical 
solutions to the problem that one never knows what the true or best equating 
criterion is in a particular situatioa. 



What are the unresolved issues that relate to IRT equating? 

There are two varieties of unresolved issues involving IRT equating. One 
set of issues has to do with the mechanics of IRT equating, and these may be 
called direct equating issues. The other set of issues has to do with "the use 
of IRT in the tjest construction process, and how this then relates to IRT 
equating. These are more, indirect^ issues, such as dimensionality, but they do 
influence what can be reasonably expected from aij IRT equating. These indirect 
issues will be touched upon briefly, and then the more direct equating issues 
discussed in some detail. 

♦ 

MosC of the IRT test construction work has been done using the Rasch model. 
Advocates of using the Rasch model in test construction situations stress that 
the mo'st important criterion in deciding Upon items for a test xs goodness of 
fit of the items to the model (Rentz«and Rentz, 1978), Recently ,* two levels ofj. 
concern have been voiced reflecting this focus on goodness of fit. Wood (1978) 
and Whitely (1977) are concerned that this, focus wi^l necessarily restrict 

measurement to domains that, while unidimensional, do not necessarily measure 

J* • % . 

what we really want to measure. Gustafsson (1979a), on the other hand, is 

concerned that the usuafly applied Rasch goodness of fit tests' are not sensitive 

to multidimensiooality among t)he items, and advocates the application of a 

number of other tests sensitive .violations of uniclimensionality. Finally, 
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Wood . (19787^38 f itted random data to the Rasch model and was not stopped by the 



usual goodness of fit tests. "While not. wanting to enter furjther into a debate 

. about the use of goodness of fit tests to construct unidimensional tests, we 

• * • 

shall note from, the above th^t the use of IRT equating in test construction 

\ . . . S\ ' • ' 

activities may nwt bi as straightforward as suggested. If the constructed tests 

are not unidimensional, then ther issue becomes exactly the same as that addressed 

in the test analysis mode — namely, how robust is IRT to violations of assumptions 
• • * « 

in equating situations. Hence, unless the process of test construction leads to 
a unidimensional domain of meaningful content, IRT equating, procedures must be 
considered in a. different light, no longer as a natural outcome of the test 
development process. 1 

Tttere' are a number of more direct unresolved issues that will be addressed 
next. Many of these issues have 'come to the front in the IRT equating work 
that i$ ongoing at ETS. When using the thVee-parameter* logistic model for 
equating, two specific issues have come up. One has to do with the type of 
score, to be efquated when ability estimates cannot be used for reporting purposes. 
This is particularly a problem for testing pro-ams that hav£ a long history of 
use of a particular scale and .forms placed on that scale through conventional 
observed formula score equating. When ,JRT equating is done, should the relation- 

* ship between estimated number right tru^e scores, estimated true formula scores, 
or estimated number; right observed score frequency distributions on thg new and 
base forms.be used to place the'new form on scale? Ideally, the relation- 
ship between estimated observed formula score frequency distributions should be 

but this relationship is unobtainable using IRT methods. The second issue 
•has to do with which calibration design is best for linking parameter estimates 

„-f or w^ch sort- of data* As explained in Appendix A, there are essentially 
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three methods of getting^ parameter estimates on the same scale using LOGIST 
with anchor test designs*. Method one,, called concurrent calibration, involves 
running all the data in on^,OGIST run, treating data for' a particular group 
on the form not taken as not reached.* (Figure 1 represents a concurrent 
calibration run,) Method two involves fixing the difficulties for the common 
items in\tlie second calibration run at the values estimated in the first 
calibration ruh. Method. three involves estimating fhe parameters separately 

• / ..... '• . ' ( 9 

in two calibration runs and then using the relationship between the difficulty 
parameters for the common items to place all parameter estimates on the same 
scale. Experimentation at' ETS with these methods seems to suggest that no one 

method is unifgrmally best, but that the choice of method seems to vary with the 

* * * * 

v * 

data set. 

Another issue presently of interest has to dp with the one-parameter * 
t * * 
logistic model, where essentially two separate 1RT equating procedures can be 

used** *One procedure, usedJ>y Rentz and Bashaw (1975)* and Loyd and Hoover 

(1980), is based on the direct relationship between Rasch model observed scores 

* * c 

and ability estimates. Observed scores on test forms corresponding to the 

same ability estimate are considered equated. The other procedure, used by 

Kolen (1981), corresponds to that usually used for the three-parameter logistic 

model, where there is no direct relationship between observed score and ability. 

/ 

For any particular ability, knowledge of' the itjem parameter estimates for each 
form allows generation of estimated Irue scores, which can be considered equated 
(see Appendix B). From these estimated true scores, frequency distributions of 
estimated observed scores may be generated and equated using conventional ✓ 
equiperceotile ♦methods ; While the first procedure mentioned above is straight- 
forward, there is a problem if for A particular ability level, corresponding 
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integer raw scores do^iot exist on the two forms. With the other procedure, the 
problem of missing data does not exist because the estimated true score relatioli- 
ship can be determined for any ability, level, not just those ability estimates 
derived from the data. Of interest is whictTprocedure would be best to use in. 
which situation. * \ - 

* 

there are a number of other issues of a more general nature that. will be 

briefly mentioned. One has 'to do with. the demonstration of unidimensionality a * 

for tests being vertically equated,. For a variety of reasons, the assumption g 

of unidimensionality can be violated for tests that are intentionally built to 

vary in difficulty, and procedures need to be considered that address this 

concern. Another issue has to do with determining which types of test data 

# 

IRT equating procedures will work best with and what types are problematic. 
While this can be viewed as a dimensionality issue, a robustness issue, or 
both, there is more to it, 4 It is conceivable that IRT equating will be of 
differential utility for a variety of tests, all of whiph-do not greatly 
violate the assumption of unidimensionality. It would be useful to know for 
which kinds of tests IRT equating worjcs best and for which it gives the poorest 
results. Finally, an issue presently of interest at ETS is how to determine or" 
establish a* base scale when using IRT procedures. ^As mentioned earlier, for a" 
variety of tests, ETS is locked into usinga previously established scale. The 
issue of a new scale would present itself if either a new, program were being 
introduced or a- decision were made to change content specifications an existing 
tests to the extent that" equating was no longer possible and perpetuation of the 



existing" scale unreasonable. Should scores then be reported on the ability 
metric, some J,inear^transfonnation of th£t scale, tfre estimated true score 
scfcle, or the estimated observed score -sc^le? Of interest is the generation of 
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arguments^ in favor of each scale so Chat aiuinforraed decision can be made x . 
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Conclusions 

The purpose of this paper was to address, using available research, some * ♦ 

,' • *• 

practical issues relating t?o IRT equating procedures. The outcome of the paper 

is most likely phat we have brought up" more issues yet to be resolved than we 

have clarified existing 4 issues. This is undoubtedly due to what is presently 

i £ 

known about IRT equating procedures. Hopefully as more IRT equating research is 
done the questions poSed in this w paper will come to be resolved. 
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- * Appendix A, « 

» v 

Scaling Parameters — Three-Parameter Logistic Model 

Situatioil 1. Two different &ets of iterfs (Form X? and Form Y) are given 
* to the same group of , examinees - / 



a. Calculate: M fl . , $D Q , M A , SD 



X 



8, 



. where X and Y designate Forfcs X and Y and M and SD 
represent the* means and t standard^ deviations of 6's . 
(ability parameters) estimated by the two test forms. 

b. 1^, the assumptions of the mo4eI^are met, the e's will 
have the ftjllowing linear relationship: 



8 Y '* A8 X + B 



(1) 



SD 



where A =* 



SD 



and B * M 



X' 
- AM 



6. 



c. 



V Y X 

The item parameters, are adjusted as follows; 



* 




c 


i C 


g 


g 


* 




a 


» a A 


g 


g 




b - B 






b 

g 


-*t- 




(2) 

t 

(3) 
(4) 



Situation 2. The same set of items is given to two different 'groups of 
" examinees (Group A and' Group B) 
. * * • N f ' * ' 

Calculate: ^ , SP^* , ^ , SD fa > 

''4, .J> 



I 




Y 



where M arid SD represeirtf^meaaas and standird deviations, 
the subscripts A and B represent groups afcd^> represents 
the item difficulty parameter ♦ I - . ' V 

b. If the.assunrotlbns^bf ^He^modpl are f met,* the b f s. wi*ll 
have the following linear relationship 



V" Ab A + B " ' . • • (5) 

b B • •«* 

where, A « 3-— . | 

■ * v . . 



and* 




The item- discrimination (a^) and psuedo guessing , 

parameters (c ) as we^Ll as ability estimates (9 ) 
o a 
are adjusted as follows: 

* 

a ='a A (i) 
^ . % - B) . . 

Situation 3. Some items that are the same and some items that are 

dif f^retTt, are administered to different groups of examinees ' 
^ (Group A and Group B) 

a. ExpressionsT 5-8 can be used in^this situation. Linear 
parameters (A and B) determined* from the common items - 

-? * a given to the two groups of examinees are used to adjust » 

all, item parameter 'and ability estimates obtained for 
one* of the forms to the scale of the second form. 

b. The following is an alternative method that may be 
used^ in this •situation. 1 

i. Estimate parameters for Form Y and the common items 
_ using data obtained when the form was given to 

> Group B 

ii. Estimate parameters for Form X and the common items 
• using data obtained when the form was given to 

•Group A holding! the b values for the common items 
fixed at estimated yalues obtained 'from Group B 

# 4 iii. This procedure ensur.es that Form X item parameters 

, ' and ability estimates will be on the Form Y scale. 

» 

Scaling Parameters ~ One-Parameter Logistic Model 

Situation 1. Two different groups' of items (Form X and Form Y) are given 
to the same group of Examinees 



^Jotrall computer pfograma haye the capabilities, of accepting parameter 
estimates from. a previous run. LOGIST, the computer program used at 
ETS, does have^this capability. * ' * . - — 
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> a. Calculate: M Q , M Q 

, X 9 Y 

> • : 

b. Calculate the linking constant, k ■ M - M 

.rx 9 x ' N 

c. Adjust all ability parameters jjtimated by- Form X.as 
follows; 

. 8* - 8 - + k " . (9) 

*X *X 

d. Adjust ail Form X difficulty parameters as follows: 

/ % \ 

/ b* « b + k <. (10) 

Situation 2, Two different groups of items (Form X andJForm Y)^along with 
' a. common set of items are given to two different groups of 
examinees (Group A and Group B) 

a. ' Calculate: , / 

A B 

* 
I 

where and refer to the mean easiness of the 

A B ^ 

common items given to the respective! groups 

* ■» 

b. Calculate the linking constant, k « - ^ 

c. Adjust the form X Jtem easine&s parameters as follows: 

b*, - b + k ♦ ' (11) 

8 X g x . 

.* ^ d, Adjust all ability parameters estimated. by Form X as 
follows: 

' ' 9* - 9 + k (12)* 

Situation 3. The same set of items is given to two different groups of ~ 
examinees (Group A and -Group B) 

• , »< • . 

* , a. In this case, if one calculates ^ and.M^ based on all 

.A- B 

»• the items, the^ will be equal within sampling error* 
Hence, there is, no linking constant- — all parameter 
* estimates are on the same scale without adjustment* 
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Equating Designs 
Single Group Design « 

JL. Two test forms are giv$n to the same group of ex^ninees 
(^^"ff^^Couventional metho^^rlTvery well in tfrLs situation . 

b. Simplest approach would be to estimate all item and 
ability parameters in a single computer run 

i. All item parameters and ability estimates will 
be on the same scale k . 

ii. Estimated 6 f s obtained from the two forms will be 
identical except *f or sampling errpr. If one is 
willing to report ability estimates to examinees* 
*• no further effort is necessary, 

c. Item parameter and ability estimates could be obtained^ 
in two separate computer ru ; ns 

•i. * This would necessitate placing item and ability 

parameter estimates on the same scale*. Procedures 
given for Situation 1 could be used for this purpose 

Random Groups Design 

=^ * 
-1. Two randomly selected groups each take a different form of 
the. same test * * 

a. Conventional methods Vork fairly well in this situation 

b. Assumption is that two groups are equivalent in ability^ 

c. Could analyse the data in two separate computer runs and 
' " use the procedure described in Situation 1 to place 

item and ability parjkmeters on the same scalar 

d. The following procedure could also be used 

i. Analyze the data in two separate computer runs 

ii. Obtain a distribution of 8's for each data set, 
e.g. Form X given to Group A, Form Y gitfen to 
Group B , J 

iii. Equate the 9 f s obtained from the two runs by 
ordinary equipercentile methods 
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C. Anchor Test Design 

i ! 

1. Two groups of examinees* take two different forms .of a test, 
but each form contains a common set of items 

a. - Simplest;way to^accomplish the equating is to estimate 

all item and' ability parameters, together in a single 
computed run . * 

b. The two^orms.of the test to be equated are cpnsicfered 
to be one long test consisting of items comprising 
Form X and Form Y 

c. All of the examinees in both groups (Group A and Group B) 
are assumed to have taken all of the items in both 

test forms; where there are no responses, the items 
-are assigned 'to be not reached ,1 

, d. The item. parameter estimates for botU farms will be on 

the .same* scale and ability estimates obtained from either 
form will be equivalent 

e. Suppose item and ability parameters haj^g been obtained 
• sep«ately for the test forms adminisliRted to their 
> t#i respective ^groups 

i. Procedures given for, Situation 3 aould be used 
to place _all item and ability parameters on the 
, ' same scale * 

ii. An alternate procedure. would be to estimate the item 
and ability paiaAeters for Form Y given to Group B 
/ * Following this/ the item and ability parameters ,f or 
frorm X given-'To Group A Would be estimated fixing 
the item difficulty parameters, for the common set 
of items contained in Form X at the values previously 
obtained from the Form Y estimation procedure. 1 
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^he computer program LOGIST used at ETS for parameter estimation 
has the capability 'for "dealing with these options; other programs 
do not. Henc% parajneter estimates derived from these programs 
must be put on stale using the piethoa v dS^ ribed previously in 
Situation 3. * * 



Appendix B 

Alternatives to Equating Ability^ Estimates 
Equating Estimated True Scores 

A 

When reporting 8 ! s is not a viable alternative for a 
testing program, it is possible to use the relationship between 
9 and true score to obtain equated estimated number 1 right true 
scores 

!• If Form X and Form Y are both measures of the same ability, 
0, then their estimated number right true scores can b$ 
calculated as follows: 

i»l 



T v ■ £ P,(9) ; * (2) 

where, 

* Form X estimated true score for a 'given 9 ^ 
T« * Form Y estimated true score for V^given 9 
and ' P^(9), ^j(^) are the, item response functions for 
items i, .tug (in Form X) and j, j*!...^ (in 

Form Y) respectively, using parameter estimates 

2. Using expressions 1 and 2, it is possible to .find an 

estimated. number right true score on Form X that is equivalent 
to an 'estimated number right true score on Fotm Y for any 
given 9, t , 

It is also possible _ta use the parameter estimates to obtain 
equated estimated true formula, scores 

!• Estimated true formula scores are calculated in a manner 
similar^ to that used to obtain estimated true scores: 

n x ^ * 

' X - I P 4 (6) - [ I Q.(9)]/A-l .(3) 
* i-1 1 i-1 1 : 

L - z P 4 (e) - [,z .Q,(e)j/A-i (4) 

. j-1 J • * ■ ^ 
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where, 

. _ A -the number of choices per*item 

, .j^ » Form X estimated true formula score for a given 0 

c ILy » Form Y estimated tfcue formula scotfe for- a given 0 

P^), are as defined previously 

and ^(9), (^(9) are equal to 1-^(9), 

,/ 1-P. (9), respectively ■ 

2. As was the* case for estimated number right true scores, it 
is passible to find an estimated true formula score on 
t Form'X thaf is equivalent to an estimated true formula score 
on Form Y for any given 9. It should be noted that in both 
instances, the equations implicitly assume; however, that 
* • every individual responds to all items; i*e. , there are 
no omissions or not reached items* 

B. Equating Estimated Number Right Observed Score Frequency Distributi 

A third possibility is to generate estimated number right 
observed score distributions for Form X and Form Y and to equate 
these observed -score distributions using ordinafy equipercentile 
equating methods 

» 

1. The frequency -distribution of number-right observed scores 
for a given 9, f(xje) is a .generalized binomial distribution 
(fcendall and Stuart, 1969, Section 5.10). This distribution 
can be generated by €he generating function. 

n (P 4 +Q 4 ) (5) 
i-i . ' 

2. Using the parameter estimates in P^ffi* the estimated 
total group or marginal distribution of number right observed 
scores will be ■ 

, n 

f(x) -f I £(x|8 ) (6) 
a»l 

where, a indexes examinees ^ 

C. Which type of Score is Mos£ Appropriate m ' 

~ 1. Estimated true scores have the following disadvantages: 

a.- The possible range for true scojes is only from 
n l \ \ 

T » E; c ft ( the pseu4o chahce level) to t ■ n. Wany 
i«l 1 1 ' * • ' - 

t 



^This is, of course, a problem tfhen the three-parameter logistic 
model is used. # 46 



testing programs report scores below .this level and 
therefore require an equating process that will provide 
conversions for the lower level scores 

b. A procedure exists for providing these conversions 

i. Determine the mean anchystandard deviation of . 
scores below chance levkl^oh Form X 



ii. 



A-1 t * C i " A-1 



X 

'where, 



V A-1 ; 



Y ■ 
il'l 1 



i 

\ 2 
- Z C i 
i-1 



(7) 



(8) 



A 
c. 



the mean of 'Form X scores below chance 
level, 

the variance" of Form X scores below 
•chance level, ' 
the number of choices per item, and 
the psuedo guessing parameter for item i 



Equations 7 and 8 are repeated to determine 

( the mean of Form Y scores below chance level) 

2 ' 

and S Y ( the variance of Form Y scores below 
chance level * } 



iii. Linear parameters for equating Form X scores 1 

below chance level to Form Y scores below chance 
level are determined*"^ follows: ' 

. . i 

b X 



B-H, -A^ 



(10) 



iv* This procedure may* also be used to determine the 
• linear parameters for equating number right true 
scores below change level* In this case, 

^X ~ ^X ^X « 

* - >- Z c, and S„ « I - Z c 



1-1 



i-1 



i-1 



The equlpercentlle equating of estimated number right observed 
scores also has a disadvantage In that conversions obtained 
from this type of equating may not be applicable, in the 
strictest sense, to observed formula scores. .(Note that it 
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is not possible to generate an estimated observed formula 
score frequency distribution*) 

D. Using Rasch 9 Estimates to Equate Actual Observed Scores 

t ^ 

„ . . Because of the special, nature of. the Rasch model X total .score is a. 
sufficient statistic for estimating ability, a monotonic relationship* exists 
between raw score and estimated Ability), it is possible to use the results 
of IRT parameter estimation to directly equate actual number right observed 
scores* This is the procedure used fey Rentz and Bashaw (1975) in performing ' 
trhe raw score to raw score equatings when applying the Rasch model to the 
Anchor Test Study data. It was also used by Loyd and Hoover (1980). The 
steps listed below are synthesized from Rentz and Bashaw's (1975) procedure: 

1. It. should first be noted that a conversion or scoring table is standard 
output from a Rasch parameter estimation program. This table lists 

for every obtained ratf score on the test the corresponding ability 
^ estimate 8 • 

2. For the two tests to be equated (X and Y), there will be two conversion 
tables, one relating x to 9 X and the other y to 9 y » . Suppose 
further^that one of the parameter scaling methods has been used to 
obtain 9 X *, '.which is now on the same Scale as 6 y . (Y is the base 
test.) I 



3. For each pps^ible score yj , find the score xi such that 9 y 7 9 x is 
a nflLnimua> \ m j * 

4. The score x. that minimizes 9-9 is the equivalent score of 

i* r y± 

y^ * j i 



There is a problem with the above procedure which results in what Rentz 
and Bashaw (1975) refer to .as "assignment error". Assignment error occurs 
when it is necessary to assign an examinee a raw score on the equated test 
. that is .most equivalent to a raw score on the bask test. Suppose, for instance, 
that on the base test a raw score of 10 corresponded to a 9 of 2.0, and an the 
equated test a raw score of 9 corresponded to a 9 of 1.9 and a raw score of 10 
to a 9 of ,2.2. In tfiis case, a raw score of 9 on the squared test would be 
taken to be equivalent to a raw score of 10 on the base test because the 9 of A 
U9 is c/osest to 3.0. The assignment error would be the difference in these 8 
Obviously, because, of the discrete nature of raw scores, longer tests, having 
more r«£ scores and 9's, .will result in fewer assignment errors. However, . 
the saAe sort of problem -would occur if there were missing data*(raw scores). m 
This speaks to the need for an adequate sample of examinees whose abilities 
tfill cover the range of possible raw scores on the two test forms if this sort 
of equating is to be considered. 



